Biking has become a modern, eco-friendly and comfortable way that many people nowadays use it as their daily transportation. Bike-sharing has been widely promoted in various metropolitan cities because it is not only just a tool hopping between different places, but also offers leisure and relaxing experience of transportation.
Seoul, the capital of South Korea, like many countries, has its own bike-sharing system. The system provides its service via an app, that allows registered users to access a city-wide bike fleet for private usage. Moreover, people can rent and return a bike from one bike station to another that belongs to the same network.
The Seoul bike-sharing system was launched in 2017. Its user-base has grown rapidly within a short time. In bike rental business, it is essential to have sufficient amount of bikes that meets customer needs. Thus, the availability and accessibility of bikes have listed questions of how can bike rental business maintain stable supplies to meet riders’ demands for bikes, and how does the business utilize the demand to implement price strategies (for example, fixed price at peak time while discounted price during the low period).
To solve these puzzles, having a prediction of bike demands would help business optimize its resources and operations and provide stable supply to riders. This project aims to use various machine learning models to predict demand for rental bikes. On the other hand, an hourly-based prediction of bike usage can be used for adjusting rental price accordingly with the demands, time, and data events.
image source: https://shorturl.at/btIOP
The dataset is collected from Kaggle (source: https://www.kaggle.com/saurabhshahane/seoul-bike-sharing-demand-prediction). There are totally 8,760 rows × 14 columns in the original data set.
bikes <- read.csv("SeoulBikeData_Original.csv", header = TRUE,
fileEncoding = "Latin1", check.names = F)
head(bikes)
The Data variables, type and description are shown in the following table. The variables of the data set meet our research requirements.
Data variables and description:
| Variables | Type | Description |
|---|---|---|
| Date | MM/DD/YY | The date of the record (2017 Dec. -2018 Nov) |
| Rented Bike count | Discrete | The number of bikes rented for the hour of the data |
| Hour | Categorical | Each hour of the record for one day |
| Temperature(°C) | Continuous | The environment temperature in °C recorded |
| Humidity(%) | Continuous | The relative environment humidity in % |
| Wind speed (m/s) | Continuous | The speed that the air is moving in m/s |
| Visibility (10m) | Continuous | Distance (10 m scale) of clearness |
| Dew point temperature(°C) | Continuous | The temperature to which the air would have to cool to reach saturation |
| Solar Radiation (MJ/m2) | Continuous | The electromagnetic radiation emitted by the sun in MJ/m2 |
| Rainfall(mm) | Continuous | Height of the precipitation in mm |
| Snowfall (cm) | Continuous | Height of the snowfalls in cm |
| Seasons | Categorical | Spring, Summer, Autumn, Winter |
| Holiday | Categorical | Holiday, Workday (No Holiday) |
| Functioning Day | Categorical | The days when the rental bike system operate (Yes) or does not operate (No) |
All data were inspected first to see whether any NaN values and duplicated rows existing. The column ‘Date’ was divided into ‘Year’, ‘Month’, ‘Day’, and ‘Weekdays’ for analyzing purposes. After processing the data, the data set contains 8,760 rows × 21 columns.
The index of variables and its corresponding items:
Day_of_Week:{ ‘Sunday’: 1, ‘Monday’: 2, ‘Tuesday’: 3, ‘Wednesday’: 4, ‘Thursday’: 5, ‘Friday’: 6, ‘Saturday’: 7}
First, we checked whether there’s any duplicated or missing data in the data set.
anyDuplicated(bikes)
## [1] 0
sum(is.na(bikes))
## [1] 0
Here we change the data respectively to year, month, day, and day of week.
library(lubridate)
Hourr <- function(x) {
val2<-paste(x,":00",sep="")
return(val2)
}
bikes$hour1 <- unlist(lapply(bikes$Hour,Hourr))
bikes$Date <- strptime(as.character(bikes$Date),format="%d/%m/%Y")
bikes$year <- year(bikes$Date)
bikes$month <- month(bikes$Date)
bikes$day <- day(bikes$Date)
bikes$wday <- wday(bikes$Date)
bikes$fday <- bikes$'Functioning Day'
bikes$fday[which(bikes$fday=="Yes")] <- 1
bikes$fday[which(bikes$fday=="No")] <- 0
bikes <-subset(bikes,fday==1)
bikes$'Functioning Day' <- NULL
bikes$fday <- NULL
bikes$hour1 <- NULL
After transforming data related to date and time, we then filter out the “Yes” rows for the functioning day, since “No” means the service wasn’t working. All rows which ‘Functioning Day’ equals to 0 (did not operate) were deleted. Last, because all data left have functioning day = 1, this column is also deleted for simplifying the data set.
names(bikes)[2] <- "bike_count"
names(bikes)[3] <- "hour"
names(bikes)[4] <- "temperature"
names(bikes)[5] <- "humidity"
names(bikes)[6] <- "wind_speed"
names(bikes)[7] <- "visibility"
names(bikes)[8] <- "dew"
names(bikes)[9] <- "solar"
names(bikes)[10] <- "rainfall"
names(bikes)[11] <- "snowfall"
names(bikes)[12] <- "season"
names(bikes)[13] <- "holiday"
bikes$season <- as.factor(bikes$season)
bikes$year <- as.factor(bikes$year)
bikes$month <- as.factor(bikes$month)
bikes$day <- as.factor(bikes$day)
bikes$wday <- as.factor(bikes$wday)
bikes$hour <- as.factor(bikes$hour)
bikes$holiday <- as.factor(bikes$holiday)
We renamed the columns to simpler terms for coding purpose and set as factors to categorical variables.
bikes$highdemand <- bikes$bike_count
bikes$highdemand[bikes$bike_count >= 500] <- 1
bikes$highdemand[bikes$bike_count < 500] <- 0
bikes$highdemand <- as.factor(bikes$highdemand)
bikes <- bikes[c(1,14,15,16,17,3,2,4,5,6,7,8,9,10,11,12,13,18)]
str(bikes)
## 'data.frame': 8465 obs. of 18 variables:
## $ Date : POSIXlt, format: "2017-12-01" "2017-12-01" ...
## $ year : Factor w/ 2 levels "2017","2018": 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
## $ day : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ wday : Factor w/ 7 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ hour : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ bike_count : int 254 204 173 107 78 100 181 460 930 490 ...
## $ temperature: num -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -7.6 -6.5 ...
## $ humidity : int 37 38 39 40 36 37 35 38 37 27 ...
## $ wind_speed : num 2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 1.1 0.5 ...
## $ visibility : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 1928 ...
## $ dew : num -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -19.8 -22.4 ...
## $ solar : num 0 0 0 0 0 0 0 0 0.01 0.23 ...
## $ rainfall : num 0 0 0 0 0 0 0 0 0 0 ...
## $ snowfall : num 0 0 0 0 0 0 0 0 0 0 ...
## $ season : Factor w/ 4 levels "Autumn","Spring",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ holiday : Factor w/ 2 levels "Holiday","No Holiday": 2 2 2 2 2 2 2 2 2 2 ...
## $ highdemand : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
Using the 500 as a reference for determining high demand, the bike number greater than 500 is marked as ‘1’ (high demand), conversely the bike number less than 500 is marked as ‘0’(not high demand). After reordering the columns, our data set was ready for further modeling processes.
library(caret)
library(dplyr)
set.seed(1)
indices <- createDataPartition(bikes$hour, p = 0.8, list = FALSE)
train <- bikes %>% slice(indices)
test <- bikes %>% slice(-indices)
#write.csv(train, file="train.csv", row.names = F)
#write.csv(test, file="test.csv", row.names = F)
Last, we separated the dataset into 80% for train models and 20% to test models randomly by hour. (The datasets were exported as CSV files for team members to process models on their own devices.)
Graphics below show the data spreading in different continuous variables.
Here are the data distribution of bike counts in different categorical variables.
According to the graphs, the predictors wday, day, year do not seem to be very important. From the boxplot of the year variable, we can see that we do not have data for the whole 2017 and 2018 years. Importance of predictors season, month, hour, holiday seems to be high.
If bike count is over 500, it’s categorized into high-demand(1); if it’s below 500, it’s low-demand (0). Graphs below show how high-demand data is distributed in different variables.
High-demand seem to show differences in most variables in the data set.
From a business perspective, it’d be interesting to know what factors impact the count of total rental bikes. However, since Linear Model can only be used to predict continuous variables, bike count, as a count variable, cannot be used directly as a predictor here. Therefore, we’ll transform the count data with the square-root function and fit a Linear Model.
The relationship between bike counts and temperature looks like a non-linear relationship, where the bike count increases by temperature until 30-degree and starts to decrease from this point. With humidity, wind speed and visibility, there were slight linear relationships in between, but the effects aren’t too obvious.
With variables such as dew points, solar radiation, rainfall and snowfall, the linear relationship stay pretty slight for all.
Instead of month, season was taken in the interaction plot to have a clearer picture. But month would be used as variable, if interaction exists.
Lines in the interaction plots of season/holiday and holiday/hour stay parallel.Therefore, there’s no significant interaction between these factors. However, intersection appeared between weekday and month, which means that there might be interaction between weekday/month and temperature/holiday effects on bike counts. We’ll consider this effect in the Linear and GAM models.
We start fitting the model by containing all variables and use drop1 function to examine variables that can be removed. As mentioned above, the count data of bike counts will be transformed with the square-root function here in order to fit a Linear Model.
lm.bikes.0 <- lm(sqrt(bike_count) ~ temperature + humidity + wind_speed + visibility + dew
+ solar + rainfall + snowfall + month + holiday + hour + wday, data = train)
drop1(lm.bikes.0, test = "F")
## Single term deletions
##
## Model:
## sqrt(bike_count) ~ temperature + humidity + wind_speed + visibility +
## dew + solar + rainfall + snowfall + month + holiday + hour +
## wday
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 224483 23841
## temperature 1 86 224569 23842 2.5830 0.1081
## humidity 1 7854 232337 24073 235.6326 < 2.2e-16 ***
## wind_speed 1 71 224554 23841 2.1217 0.1453
## visibility 1 638 225121 23859 19.1422 1.232e-05 ***
## dew 1 1877 226360 23896 56.3202 6.951e-14 ***
## solar 1 1040 225523 23871 31.2095 2.407e-08 ***
## rainfall 1 20033 244516 24419 601.0201 < 2.2e-16 ***
## snowfall 1 0 224484 23839 0.0105 0.9186
## month 11 59601 284084 25417 162.5598 < 2.2e-16 ***
## holiday 1 3581 228065 23947 107.4518 < 2.2e-16 ***
## hour 23 171322 395805 27643 223.4797 < 2.2e-16 ***
## wday 6 5724 230207 24000 28.6225 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the p-value, factors including temperature, wind speed and snowfall don’t have significant influence on bike count. Therefore, the model below will remove these predictors.
lm.bikes.1 <- lm(sqrt(bike_count) ~ humidity + visibility + dew + solar
+ rainfall + holiday + hour + wday, data = train)
drop1(lm.bikes.1, test = "F")
## Single term deletions
##
## Model:
## sqrt(bike_count) ~ humidity + visibility + dew + solar + rainfall +
## holiday + hour + wday
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 285765 25451
## humidity 1 55747 341512 26658 1316.5820 < 2.2e-16 ***
## visibility 1 250 286016 25455 5.9152 0.01504 *
## dew 1 199817 485582 29046 4719.1240 < 2.2e-16 ***
## solar 1 2304 288069 25504 54.4183 1.814e-13 ***
## rainfall 1 20299 306065 25915 479.4118 < 2.2e-16 ***
## holiday 1 4415 290180 25553 104.2663 < 2.2e-16 ***
## hour 23 183778 469543 28774 188.7099 < 2.2e-16 ***
## wday 6 5376 291141 25565 21.1592 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Under the drop1(), every variable is now significant. We’ll add interaction variables into the model and conduct further inspection.
lm.bikes.2 <- lm(sqrt(bike_count) ~ humidity + visibility + dew + solar + rainfall + holiday
+ hour + wday + temperature:holiday + wday:month, data = train)
drop1(lm.bikes.2, test = "F")
## Single term deletions
##
## Model:
## sqrt(bike_count) ~ humidity + visibility + dew + solar + rainfall +
## holiday + hour + wday + temperature:holiday + wday:month
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 210544 23536
## humidity 1 6176 216721 23730 195.666 < 2.2e-16 ***
## visibility 1 1689 212234 23589 53.518 2.863e-13 ***
## dew 1 1256 211801 23575 39.799 2.995e-10 ***
## solar 1 495 211039 23550 15.681 7.575e-05 ***
## rainfall 1 16563 227107 24048 524.699 < 2.2e-16 ***
## hour 23 172403 382948 27549 237.465 < 2.2e-16 ***
## holiday:temperature 2 772 211316 23557 12.228 5.004e-06 ***
## wday:month 77 73172 283716 25406 30.105 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again every variable looks significant.
summary(lm.bikes.0)$r.squared
## [1] 0.7638281
summary(lm.bikes.1)$r.squared
## [1] 0.6993551
summary(lm.bikes.2)$r.squared
## [1] 0.7784927
anova(lm.bikes.0, lm.bikes.1)
## Analysis of Variance Table
##
## Model 1: sqrt(bike_count) ~ temperature + humidity + wind_speed + visibility +
## dew + solar + rainfall + snowfall + month + holiday + hour +
## wday
## Model 2: sqrt(bike_count) ~ humidity + visibility + dew + solar + rainfall +
## holiday + hour + wday
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6735 224483
## 2 6749 285765 -14 -61282 131.33 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm.bikes.0, lm.bikes.2)
## Analysis of Variance Table
##
## Model 1: sqrt(bike_count) ~ temperature + humidity + wind_speed + visibility +
## dew + solar + rainfall + snowfall + month + holiday + hour +
## wday
## Model 2: sqrt(bike_count) ~ humidity + visibility + dew + solar + rainfall +
## holiday + hour + wday + temperature:holiday + wday:month
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6735 224483
## 2 6670 210544 65 13939 6.7935 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
linear_model <- lm.bikes.2
Model 2 has the highest R-Square, followed by model 0 and model 1. Anova test, on the other hand, showed both model 1 and model 2 have significant difference to model 0. Consider the R-Square, we’ll choose the lm.bikes.2 model for the next model evaluation.
pred.lm <- predict(linear_model, newdata = test)
R.lm <- cor(pred.lm, test$bike_count)^2
R.lm
## [1] 0.6971688
Conclusion: From the residual analysis graphs, a long-tail effect on the left side in the QQ plot. While the in-sample R-Square was 78%, it dropped into 70% in out-sample test. A linear model could explain about 70% of the dataset, however, it maybe not be the best model.
library(mgcv)
library(forecast)
library(stats)
Plotting the data.
bikes$Date <- as.Date(bikes$Date, format = "%Y-%m-%d")
p1 <- ggplot(data = bikes, mapping = aes(y = bike_count, x = Date)) +
geom_point(color="gray25") + geom_smooth(color="brown3") + ylab("Rental Bike Count")
h2 <- ggplot(data = bikes, mapping = aes(x = bike_count)) + ggtitle("Histogram of Bikes Rentals") + geom_histogram(binwidth = 30)
grid.arrange(p1, h2, ncol = 2)
The distribution of Bikes Rentals is skewed. However, we will not log-transform the response variable (Bikes Rentals) as there are 0 values.
Box plots showed in the 3.1 Graphic Analysis section indicates that, the predictors wday, day, year do not seem to be very important. From the boxplot of the year variable, we can see that we do not have data for the whole 2017 and 2018 years. Importance of predictors season, month, hour, holiday seems to be high.
Meanwhile, Scatter plots with Smoother in the Linear Model have shown the response variable against each continuous predictor, to determine which kind of effect predictors have on the response variable.
Based on the boxplots, the predictors season0, holiday, hour, month are expected to be important. The effect of all the continuous predictors can differ among these categorical variables. In the graph below, we will inspect, whether temperature interacts with season.
ggplot(data = bikes, mapping = aes(y = bike_count, x = temperature)) +
geom_point(color="gray25") + geom_smooth(color="brown3") + facet_wrap(. ~season)
All relationships do not seem to be linear. To estimate the kind of relationships, we will fit a GAM model and see what the estimated complexity of the smooth term is.
The starting GAM contains all continuous predictors which are taken as smooth terms. Since we are modeling count data, we will use the “quasipoisson” family.
gam.0 <- gam(bike_count ~ s(temperature) + s(humidity) + s(wind_speed) + s(visibility) + s(dew)
+ s(solar) + s(rainfall) + s(snowfall), data = train, family = quasipoisson())
summary(gam.0)
##
## Family: quasipoisson
## Link function: log
##
## Formula:
## bike_count ~ s(temperature) + s(humidity) + s(wind_speed) + s(visibility) +
## s(dew) + s(solar) + s(rainfall) + s(snowfall)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.291408 0.008926 704.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(temperature) 7.344 8.209 55.917 <2e-16 ***
## s(humidity) 8.864 8.990 33.166 <2e-16 ***
## s(wind_speed) 3.403 4.293 32.994 <2e-16 ***
## s(visibility) 1.932 2.403 4.317 0.0107 *
## s(dew) 8.291 8.786 13.313 <2e-16 ***
## s(solar) 8.778 8.983 91.108 <2e-16 ***
## s(rainfall) 8.195 8.798 23.817 <2e-16 ***
## s(snowfall) 1.001 1.001 2.020 0.1552
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.612 Deviance explained = 64.5%
## GCV = 192.62 Scale est. = 203.89 n = 6785
If we use a strict 5% threshold, there is weak evidence that the smooth term of snowfall may play a role. According to the edf values, none of the strongly significant smooth terms has a linear effect. Most of the terms have very complex effects (more than 7). Removing s(snowfall).
gam.1 <- gam(bike_count ~ s(temperature) + s(humidity) + s(wind_speed) + s(visibility)
+ s(dew) + s(solar) + s(rainfall), data = train, family = quasipoisson())
summary(gam.1)
##
## Family: quasipoisson
## Link function: log
##
## Formula:
## bike_count ~ s(temperature) + s(humidity) + s(wind_speed) + s(visibility) +
## s(dew) + s(solar) + s(rainfall)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.291901 0.008919 705.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(temperature) 7.342 8.207 57.537 <2e-16 ***
## s(humidity) 8.860 8.990 34.060 <2e-16 ***
## s(wind_speed) 3.406 4.296 32.798 <2e-16 ***
## s(visibility) 1.932 2.403 4.296 0.011 *
## s(dew) 8.329 8.805 13.463 <2e-16 ***
## s(solar) 8.780 8.983 90.763 <2e-16 ***
## s(rainfall) 8.191 8.796 23.734 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.612 Deviance explained = 64.5%
## GCV = 192.63 Scale est. = 204.26 n = 6785
All predictors are strongly significant (p-values<5%). We compare the two models above with the function anova().
anova(gam.0, gam.1, test="F")$Pr # p-value
## [1] NA 0.1608531
The output shows that there is no strong evidence that the 1st model better fits the data. Thus, we will further examine the less complex model.
According to the graphical analysis, the factors season, month, holiday, hour, and wday seemed to be important. Hence, we will add these predictors to our model. However, since predictors season and month are closely related, we will choose one of the them month to have a more comprehensive insight.
gam.2 <- gam(bike_count ~ month + s(temperature) + s(humidity) + s(wind_speed) + s(visibility)
+ s(dew) + s(solar) + s(rainfall), data = train, family = quasipoisson())
summary(gam.2)$r.sq
## [1] 0.64882
anova(gam.1, gam.2, test="F")$Pr
## [1] NA 1.611917e-143
Adding the categorical variable hour to our model.
gam.3 <- gam(bike_count ~ hour + month + s(temperature) + s(humidity) + s(wind_speed) +
s(visibility) + s(dew) + s(solar) + s(rainfall),
data = train, family = quasipoisson())
capture.output(summary(gam.3))[54:61]
## [1] "s(visibility) 1.434 1.746 1.993 0.213747 "
## [2] "s(dew) 7.027 7.962 6.755 9.21e-09 ***"
## [3] "s(solar) 5.807 6.932 3.973 0.000242 ***"
## [4] "s(rainfall) 8.757 8.978 86.217 < 2e-16 ***"
## [5] "---"
## [6] "Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [7] ""
## [8] "R-sq.(adj) = 0.853 Deviance explained = 86.6%"
anova(gam.2, gam.3, test="F")$Pr
## [1] NA 0
There is a weak evidence that the smooth term of visibility may play a role (p-value<5%). Dropping visibility.
gam.4 <- gam(bike_count ~ hour + month + s(temperature) + s(humidity) + s(wind_speed) + s(dew)
+ s(solar) + s(rainfall), data = train, family = quasipoisson())
#summary(gam.4)
anova(gam.3, gam.4, test="F")$Pr
## [1] NA 0.151628
We will further examine the less complex model (gam.5). Adding the categorical variable holiday.
gam.5 <- gam(bike_count ~ hour + holiday + month + s(temperature) + s(humidity) + s(wind_speed)
+ s(dew) + s(solar) + s(rainfall), data = train, family = quasipoisson())
summary(gam.5)$r.sq
## [1] 0.8550654
anova(gam.4, gam.5, test="F")$Pr
## [1] NA NA
Adding the categorical variable wday.
gam.6 <- gam(bike_count ~ hour + holiday + month + wday + s(temperature) + s(humidity) +
s(wind_speed) + s(dew) + s(solar) + s(rainfall),
data = train, family = quasipoisson())
summary(gam.6)$r.sq
## [1] 0.8664235
anova(gam.5, gam.6, test="F")$Pr
## [1] NA 1.874588e-66
All predictors of gam.5 and gam.6 are strongly significant (p-values<5%). The higher adjusted R-squared value and the ANOVA output show that there is strong evidence that the model with more parameters (gam.6) better fits the data.
In the graphs, we have seen that temperature appeared to interact with categorical variables. We will inspect whether a different smooth term for temperature is needed for each month.
gam.7 <- gam(bike_count ~ hour + holiday + month + wday + s(temperature, by = month) +
s(humidity)+ s(wind_speed) + s(dew) + s(solar) + s(rainfall),
data = train, family = quasipoisson())
capture.output(summary(gam.7))[78:80]
## [1] "R-sq.(adj) = 0.87 Deviance explained = 87.7%"
## [2] "GCV = 67.963 Scale est. = 65.366 n = 6785"
## [3] NA
anova(gam.6, gam.7, test="F")$Pr
## [1] NA 4.724427e-21
The last model shows the highest value of an adjusted R-squared (0.87). Moreover, the ANOVA output indicates that there is strong evidence that the gam.7 model better fits the data.
Make predictions on the test data for the models with high R-squared values.
pred.gam5 <- predict.gam(gam.5, newdata = test)
pred.gam6 <- predict.gam(gam.6, newdata = test)
pred.gam <- predict.gam(gam.7, newdata = test)
Compute R-squared to evaluate out-of-sample performance.
cor(pred.gam5, test$bike_count)^2
## [1] 0.6716468
cor(pred.gam6, test$bike_count)^2
## [1] 0.6749364
cor(pred.gam, test$bike_count)^2
## [1] 0.6758239
R.gam <- cor(pred.gam, test$bike_count)^2
Conclusion: The cross-validation R-squared for the most complex model dropped consistently from the in-sample value (87% to 68%). The same is applicable for other models. However, the most complex model still shows the best performance (the highest R-squared). Thus, we will use gam.7 for the final predictions.
library(car)
library("performance")
boxplot(bike_count ~ month, data = bikes, col=c('azure3', 'azure3','mistyrose',
'mistyrose','mistyrose','powderblue','powderblue', 'powderblue',
'darkgoldenrod2','darkgoldenrod2','darkgoldenrod2', 'azure3'))
abline(h=500,lwd=2,lty=2, col="blue")
From the graph, it is observed that amount of rented bike between April and November and on non-holiday are higher than 500, which close to median (524).
This model is for predicting maximum number of bikes needed. The response variable is the amount of rented bike. It is not continuous, not negative, integer and countable. Therefore, the Poisson Model is applied. Since the amount of rental bikes from April and November and non-holidays is relatively high, we only use the data that meets the above conditions and delete the dates that operation was not provided. The choosing process is processed in both train and test dataset, which are named as train.p and test.p.
train.p <- subset(train, month %in% c(4,5,6,7,8,9,10,11))
train.p <- subset(train.p, holiday=="No Holiday")
train.p <- subset(train.p, select = c(-Date, -year, -day, -wday, -season,
-holiday, -season, -highdemand))
test.p <- subset(test, month %in% c(4,5,6,7,8,9,10,11))
test.p <-subset(test.p, holiday=="No Holiday")
test.p <-subset(test.p, select = c(-Date, -year, -day, -wday, -season,
-holiday, -season, -highdemand))
glm.full0 <- glm(bike_count ~ month + hour + temperature + humidity + wind_speed +
visibility + dew + solar + rainfall + snowfall, family = "poisson",
data = train.p)
summary(glm.full0)
##
## Call:
## glm(formula = bike_count ~ month + hour + temperature + humidity +
## wind_speed + visibility + dew + solar + rainfall + snowfall,
## family = "poisson", data = train.p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -53.489 -6.047 0.427 5.934 101.484
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.187e+00 1.104e-02 650.730 < 2e-16 ***
## month5 2.103e-01 2.326e-03 90.432 < 2e-16 ***
## month6 3.315e-01 2.694e-03 123.031 < 2e-16 ***
## month7 2.647e-02 3.571e-03 7.413 1.23e-13 ***
## month8 -1.039e-01 3.741e-03 -27.763 < 2e-16 ***
## month9 2.158e-01 2.908e-03 74.195 < 2e-16 ***
## month10 2.615e-01 2.201e-03 118.802 < 2e-16 ***
## month11 8.230e-02 2.539e-03 32.417 < 2e-16 ***
## hour1 -2.600e-01 4.143e-03 -62.761 < 2e-16 ***
## hour2 -5.925e-01 4.639e-03 -127.722 < 2e-16 ***
## hour3 -9.289e-01 5.327e-03 -174.365 < 2e-16 ***
## hour4 -1.350e+00 6.169e-03 -218.877 < 2e-16 ***
## hour5 -1.291e+00 6.171e-03 -209.161 < 2e-16 ***
## hour6 -5.290e-01 4.618e-03 -114.548 < 2e-16 ***
## hour7 1.331e-01 3.843e-03 34.633 < 2e-16 ***
## hour8 6.088e-01 3.498e-03 174.058 < 2e-16 ***
## hour9 1.166e-01 3.954e-03 29.490 < 2e-16 ***
## hour10 -1.954e-01 4.308e-03 -45.363 < 2e-16 ***
## hour11 -1.610e-01 4.435e-03 -36.290 < 2e-16 ***
## hour12 -3.026e-02 4.477e-03 -6.759 1.39e-11 ***
## hour13 -1.259e-02 4.441e-03 -2.834 0.0046 **
## hour14 -4.752e-02 4.387e-03 -10.832 < 2e-16 ***
## hour15 9.442e-02 4.150e-03 22.751 < 2e-16 ***
## hour16 2.138e-01 3.888e-03 55.003 < 2e-16 ***
## hour17 4.935e-01 3.612e-03 136.637 < 2e-16 ***
## hour18 8.234e-01 3.352e-03 245.619 < 2e-16 ***
## hour19 6.727e-01 3.368e-03 199.752 < 2e-16 ***
## hour20 6.071e-01 3.382e-03 179.495 < 2e-16 ***
## hour21 6.001e-01 3.354e-03 178.909 < 2e-16 ***
## hour22 5.033e-01 3.437e-03 146.443 < 2e-16 ***
## hour23 2.238e-01 3.658e-03 61.164 < 2e-16 ***
## temperature -5.449e-03 4.030e-04 -13.520 < 2e-16 ***
## humidity -1.378e-02 1.143e-04 -120.599 < 2e-16 ***
## wind_speed -1.413e-02 6.101e-04 -23.155 < 2e-16 ***
## visibility 3.727e-05 1.316e-06 28.320 < 2e-16 ***
## dew 2.725e-02 4.062e-04 67.078 < 2e-16 ***
## solar 2.612e-02 1.277e-03 20.453 < 2e-16 ***
## rainfall -4.973e-01 2.387e-03 -208.318 < 2e-16 ***
## snowfall -1.409e-01 3.153e-03 -44.673 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2092688 on 4309 degrees of freedom
## Residual deviance: 530005 on 4271 degrees of freedom
## AIC: 565902
##
## Number of Fisher Scoring iterations: 7
Build the first model (glm.full0) with all variables. All variables are significant.
vif(glm.full0)
## GVIF Df GVIF^(1/(2*Df))
## month 11.115811 7 1.187709
## hour 8.676991 23 1.048092
## temperature 43.321507 1 6.581908
## humidity 17.568770 1 4.191512
## wind_speed 1.428760 1 1.195308
## visibility 1.878927 1 1.370739
## dew 52.604668 1 7.252908
## solar 6.269015 1 2.503800
## rainfall 1.118601 1 1.057639
## snowfall 1.045478 1 1.022486
After measuring collinearity, the variable “month”, temperature“,”humidity" and “dew” showed serious collinearity problems. The variables “hour” and “solar” have collinearity problems as well (GVIFs are above 5).
Remove ‘humidity’: The “humidity” in the data is ‘relative humidity’. Relative humidity is the percent of saturation at a given temperature; it depends on moisture content and temperature. It could be calculated with temperature and dew point temperature. Dew point is the temperature at which the air is saturated (100 percent relative humidity). It is dependent on only the amount of moisture in the air. For avoiding collinearity, ‘humidity’ is removed.
glm.full1 <- glm(bike_count ~ month + hour + temperature + wind_speed +
visibility + dew + solar + rainfall + snowfall,
family = "poisson", data = train.p)
summary(glm.full1)
##
## Call:
## glm(formula = bike_count ~ month + hour + temperature + wind_speed +
## visibility + dew + solar + rainfall + snowfall, family = "poisson",
## data = train.p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -54.707 -6.115 0.562 5.997 107.767
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.960e+00 4.306e-03 1384.029 < 2e-16 ***
## month5 2.299e-01 2.321e-03 99.093 < 2e-16 ***
## month6 3.455e-01 2.696e-03 128.176 < 2e-16 ***
## month7 3.701e-02 3.572e-03 10.362 < 2e-16 ***
## month8 -9.600e-02 3.739e-03 -25.674 < 2e-16 ***
## month9 2.266e-01 2.910e-03 77.880 < 2e-16 ***
## month10 2.816e-01 2.199e-03 128.060 < 2e-16 ***
## month11 1.133e-01 2.525e-03 44.892 < 2e-16 ***
## hour1 -2.597e-01 4.143e-03 -62.677 < 2e-16 ***
## hour2 -5.911e-01 4.639e-03 -127.417 < 2e-16 ***
## hour3 -9.305e-01 5.328e-03 -174.667 < 2e-16 ***
## hour4 -1.350e+00 6.169e-03 -218.889 < 2e-16 ***
## hour5 -1.294e+00 6.172e-03 -209.709 < 2e-16 ***
## hour6 -5.294e-01 4.619e-03 -114.621 < 2e-16 ***
## hour7 1.451e-01 3.839e-03 37.799 < 2e-16 ***
## hour8 6.159e-01 3.496e-03 176.179 < 2e-16 ***
## hour9 1.376e-01 3.946e-03 34.880 < 2e-16 ***
## hour10 -1.797e-01 4.300e-03 -41.795 < 2e-16 ***
## hour11 -1.462e-01 4.424e-03 -33.045 < 2e-16 ***
## hour12 -2.407e-02 4.471e-03 -5.383 7.33e-08 ***
## hour13 -1.070e-02 4.436e-03 -2.411 0.0159 *
## hour14 -5.369e-02 4.383e-03 -12.250 < 2e-16 ***
## hour15 8.842e-02 4.147e-03 21.321 < 2e-16 ***
## hour16 2.131e-01 3.882e-03 54.882 < 2e-16 ***
## hour17 4.973e-01 3.609e-03 137.807 < 2e-16 ***
## hour18 8.353e-01 3.349e-03 249.442 < 2e-16 ***
## hour19 6.918e-01 3.362e-03 205.772 < 2e-16 ***
## hour20 6.255e-01 3.378e-03 185.157 < 2e-16 ***
## hour21 6.167e-01 3.351e-03 184.001 < 2e-16 ***
## hour22 5.164e-01 3.435e-03 150.341 < 2e-16 ***
## hour23 2.321e-01 3.658e-03 63.447 < 2e-16 ***
## temperature 3.651e-02 1.967e-04 185.625 < 2e-16 ***
## wind_speed -1.238e-02 6.089e-04 -20.330 < 2e-16 ***
## visibility 6.444e-05 1.292e-06 49.873 < 2e-16 ***
## dew -1.751e-02 1.554e-04 -112.709 < 2e-16 ***
## solar 2.950e-02 1.273e-03 23.170 < 2e-16 ***
## rainfall -5.537e-01 2.427e-03 -228.137 < 2e-16 ***
## snowfall -1.680e-01 3.159e-03 -53.176 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2092688 on 4309 degrees of freedom
## Residual deviance: 543634 on 4272 degrees of freedom
## AIC: 579529
##
## Number of Fisher Scoring iterations: 7
All variables are significant.
vif(glm.full1)
## GVIF Df GVIF^(1/(2*Df))
## month 10.925637 7 1.186246
## hour 8.452195 23 1.047494
## temperature 10.281605 1 3.206494
## wind_speed 1.427537 1 1.194796
## visibility 1.826441 1 1.351459
## dew 7.752833 1 2.784391
## solar 6.261017 1 2.502202
## rainfall 1.080463 1 1.039453
## snowfall 1.040608 1 1.020102
The serious level of collinearity decreased, however, the GVIF of variable “month” and temperature" are still higher than 10. As we all know, ‘temperature’ varies with ‘month’, hence, there must be a correlation or collinearity between them. On the other hand, ‘month’ is related to summer vacation, which may influence the amount of rental bike as well. As a result, both variable are kept.
drop1(glm.full1, test = "F")
## Single term deletions
##
## Model:
## bike_count ~ month + hour + temperature + wind_speed + visibility +
## dew + solar + rainfall + snowfall
## Df Deviance AIC F value Pr(>F)
## <none> 543634 579529
## month 7 624543 660424 90.8287 < 2.2e-16 ***
## hour 23 1259771 1295619 244.6768 < 2.2e-16 ***
## temperature 1 578264 614157 272.1298 < 2.2e-16 ***
## wind_speed 1 544048 579941 3.2569 0.07119 .
## visibility 1 546127 582019 19.5880 9.847e-06 ***
## dew 1 556263 592156 99.2445 < 2.2e-16 ***
## solar 1 544172 580064 4.2251 0.03989 *
## rainfall 1 679300 715192 1066.0917 < 2.2e-16 ***
## snowfall 1 547260 583153 28.4985 9.864e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glm_w <- glm(bike_count ~ month + hour + temperature +
visibility + dew + solar + rainfall + snowfall,
family = "poisson", data = train.p)
capture.output(summary(glm_w))[56]
## [1] "AIC: 579941"
When excluding the “wind speed”, the AIC didn’t decrease. So, we keep the variables “wind speed”. Given by the graphic analysis, the temperature variable doesn’t show the liner trend. It seems a quadratic or cubic effect.
glm.full2 <- glm(bike_count ~ month + hour + poly(temperature, degree=2) + wind_speed +
visibility + dew + solar + rainfall + snowfall,
family = "poisson", data = train.p)
summary(glm.full2)
##
## Call:
## glm(formula = bike_count ~ month + hour + poly(temperature, degree = 2) +
## wind_speed + visibility + dew + solar + rainfall + snowfall,
## family = "poisson", data = train.p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -51.852 -5.413 0.356 5.532 108.786
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.593e+00 4.204e-03 1568.28 <2e-16 ***
## month5 2.010e-01 2.311e-03 86.95 <2e-16 ***
## month6 3.608e-01 2.677e-03 134.78 <2e-16 ***
## month7 1.990e-01 3.617e-03 55.02 <2e-16 ***
## month8 9.300e-02 3.804e-03 24.45 <2e-16 ***
## month9 2.321e-01 2.889e-03 80.34 <2e-16 ***
## month10 3.075e-01 2.209e-03 139.22 <2e-16 ***
## month11 2.649e-01 2.643e-03 100.23 <2e-16 ***
## hour1 -2.504e-01 4.143e-03 -60.43 <2e-16 ***
## hour2 -5.893e-01 4.639e-03 -127.05 <2e-16 ***
## hour3 -9.207e-01 5.327e-03 -172.82 <2e-16 ***
## hour4 -1.340e+00 6.169e-03 -217.28 <2e-16 ***
## hour5 -1.281e+00 6.171e-03 -207.66 <2e-16 ***
## hour6 -5.148e-01 4.618e-03 -111.47 <2e-16 ***
## hour7 1.426e-01 3.838e-03 37.16 <2e-16 ***
## hour8 5.981e-01 3.501e-03 170.85 <2e-16 ***
## hour9 9.245e-02 3.961e-03 23.34 <2e-16 ***
## hour10 -2.509e-01 4.329e-03 -57.97 <2e-16 ***
## hour11 -2.298e-01 4.464e-03 -51.48 <2e-16 ***
## hour12 -1.121e-01 4.515e-03 -24.82 <2e-16 ***
## hour13 -9.823e-02 4.481e-03 -21.92 <2e-16 ***
## hour14 -1.135e-01 4.411e-03 -25.74 <2e-16 ***
## hour15 3.707e-02 4.166e-03 8.90 <2e-16 ***
## hour16 1.800e-01 3.887e-03 46.31 <2e-16 ***
## hour17 4.869e-01 3.607e-03 135.00 <2e-16 ***
## hour18 8.398e-01 3.345e-03 251.11 <2e-16 ***
## hour19 6.970e-01 3.359e-03 207.50 <2e-16 ***
## hour20 6.320e-01 3.377e-03 187.16 <2e-16 ***
## hour21 6.199e-01 3.351e-03 184.98 <2e-16 ***
## hour22 5.165e-01 3.435e-03 150.36 <2e-16 ***
## hour23 2.309e-01 3.658e-03 63.13 <2e-16 ***
## poly(temperature, degree = 2)1 2.168e+01 1.103e-01 196.62 <2e-16 ***
## poly(temperature, degree = 2)2 -8.864e+00 4.514e-02 -196.36 <2e-16 ***
## wind_speed -6.804e-03 6.088e-04 -11.18 <2e-16 ***
## visibility 6.233e-05 1.289e-06 48.34 <2e-16 ***
## dew -1.994e-02 1.547e-04 -128.86 <2e-16 ***
## solar 7.580e-02 1.299e-03 58.34 <2e-16 ***
## rainfall -5.637e-01 2.446e-03 -230.48 <2e-16 ***
## snowfall -9.723e-02 3.130e-03 -31.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2092688 on 4309 degrees of freedom
## Residual deviance: 503687 on 4271 degrees of freedom
## AIC: 539584
##
## Number of Fisher Scoring iterations: 7
glm.full3 <- glm(bike_count ~ month + hour + poly(temperature, degree=3) + wind_speed +
visibility + dew + solar + rainfall + snowfall,
family = "poisson", data = train.p)
summary(glm.full3)
##
## Call:
## glm(formula = bike_count ~ month + hour + poly(temperature, degree = 3) +
## wind_speed + visibility + dew + solar + rainfall + snowfall,
## family = "poisson", data = train.p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -53.206 -5.242 0.370 5.350 108.371
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.653e+00 4.224e-03 1575.027 < 2e-16 ***
## month5 1.565e-01 2.340e-03 66.860 < 2e-16 ***
## month6 2.482e-01 2.789e-03 88.990 < 2e-16 ***
## month7 8.645e-02 3.697e-03 23.382 < 2e-16 ***
## month8 -1.215e-02 3.863e-03 -3.145 0.00166 **
## month9 1.281e-01 2.985e-03 42.919 < 2e-16 ***
## month10 3.331e-01 2.221e-03 149.998 < 2e-16 ***
## month11 2.542e-01 2.667e-03 95.315 < 2e-16 ***
## hour1 -2.518e-01 4.143e-03 -60.775 < 2e-16 ***
## hour2 -5.866e-01 4.639e-03 -126.449 < 2e-16 ***
## hour3 -9.237e-01 5.327e-03 -173.384 < 2e-16 ***
## hour4 -1.344e+00 6.169e-03 -217.800 < 2e-16 ***
## hour5 -1.287e+00 6.172e-03 -208.483 < 2e-16 ***
## hour6 -5.168e-01 4.618e-03 -111.901 < 2e-16 ***
## hour7 1.426e-01 3.838e-03 37.152 < 2e-16 ***
## hour8 5.905e-01 3.500e-03 168.713 < 2e-16 ***
## hour9 8.845e-02 3.958e-03 22.345 < 2e-16 ***
## hour10 -2.505e-01 4.327e-03 -57.887 < 2e-16 ***
## hour11 -2.286e-01 4.462e-03 -51.230 < 2e-16 ***
## hour12 -1.181e-01 4.514e-03 -26.162 < 2e-16 ***
## hour13 -1.008e-01 4.482e-03 -22.487 < 2e-16 ***
## hour14 -1.151e-01 4.413e-03 -26.079 < 2e-16 ***
## hour15 3.297e-02 4.167e-03 7.912 2.53e-15 ***
## hour16 1.770e-01 3.886e-03 45.564 < 2e-16 ***
## hour17 4.815e-01 3.605e-03 133.568 < 2e-16 ***
## hour18 8.350e-01 3.343e-03 249.737 < 2e-16 ***
## hour19 6.942e-01 3.358e-03 206.745 < 2e-16 ***
## hour20 6.248e-01 3.376e-03 185.046 < 2e-16 ***
## hour21 6.148e-01 3.351e-03 183.475 < 2e-16 ***
## hour22 5.117e-01 3.435e-03 148.976 < 2e-16 ***
## hour23 2.298e-01 3.658e-03 62.816 < 2e-16 ***
## poly(temperature, degree = 3)1 2.335e+01 1.103e-01 211.659 < 2e-16 ***
## poly(temperature, degree = 3)2 -7.265e+00 4.567e-02 -159.057 < 2e-16 ***
## poly(temperature, degree = 3)3 -5.691e+00 4.033e-02 -141.129 < 2e-16 ***
## wind_speed -1.037e-02 6.087e-04 -17.034 < 2e-16 ***
## visibility 6.394e-05 1.290e-06 49.548 < 2e-16 ***
## dew -1.920e-02 1.548e-04 -124.062 < 2e-16 ***
## solar 7.049e-02 1.298e-03 54.313 < 2e-16 ***
## rainfall -5.584e-01 2.437e-03 -229.179 < 2e-16 ***
## snowfall -1.494e-01 3.175e-03 -47.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2092688 on 4309 degrees of freedom
## Residual deviance: 484023 on 4270 degrees of freedom
## AIC: 519922
##
## Number of Fisher Scoring iterations: 7
After adding cubic term, the model “glm.full3” shows lower AIC value.
Poisson GLM assumes that the mean and variance of the response variable increase at the same rate. If the residual deviance of the fitted model is bigger than the residual degrees of freedom, then the model has overdispersion. Overdispersion indicates that a Poisson distribution does not adequately model the variance and is not appropriate for the analysis. The values of model “glm.full3” exceeding 110.5332 are problematic, when it should be 1.
#Goodness of fit = Residual Deviance / degrees of freedom
ods <- glm.full3$deviance / glm.full3$df.residual
ods
## [1] 113.3544
glm.full3q <- glm(bike_count ~ month + hour + poly(temperature, degree=3) + wind_speed +
visibility + dew + solar + rainfall + snowfall,
family = "quasipoisson", data = train.p)
#summary(glm.full3q)
capture.output(summary(glm.full3q))[12:52]
## [1] " Estimate Std. Error t value Pr(>|t|)"
## [2] "(Intercept) 6.653e+00 6.506e+00 1.023 0.307"
## [3] "month5 1.565e-01 3.604e+00 0.043 0.965"
## [4] "month6 2.482e-01 4.296e+00 0.058 0.954"
## [5] "month7 8.645e-02 5.695e+00 0.015 0.988"
## [6] "month8 -1.215e-02 5.950e+00 -0.002 0.998"
## [7] "month9 1.281e-01 4.598e+00 0.028 0.978"
## [8] "month10 3.331e-01 3.421e+00 0.097 0.922"
## [9] "month11 2.542e-01 4.108e+00 0.062 0.951"
## [10] "hour1 -2.518e-01 6.381e+00 -0.039 0.969"
## [11] "hour2 -5.866e-01 7.145e+00 -0.082 0.935"
## [12] "hour3 -9.237e-01 8.205e+00 -0.113 0.910"
## [13] "hour4 -1.344e+00 9.502e+00 -0.141 0.888"
## [14] "hour5 -1.287e+00 9.506e+00 -0.135 0.892"
## [15] "hour6 -5.168e-01 7.113e+00 -0.073 0.942"
## [16] "hour7 1.426e-01 5.911e+00 0.024 0.981"
## [17] "hour8 5.905e-01 5.391e+00 0.110 0.913"
## [18] "hour9 8.845e-02 6.097e+00 0.015 0.988"
## [19] "hour10 -2.505e-01 6.664e+00 -0.038 0.970"
## [20] "hour11 -2.286e-01 6.873e+00 -0.033 0.973"
## [21] "hour12 -1.181e-01 6.953e+00 -0.017 0.986"
## [22] "hour13 -1.008e-01 6.903e+00 -0.015 0.988"
## [23] "hour14 -1.151e-01 6.797e+00 -0.017 0.986"
## [24] "hour15 3.297e-02 6.417e+00 0.005 0.996"
## [25] "hour16 1.770e-01 5.985e+00 0.030 0.976"
## [26] "hour17 4.815e-01 5.553e+00 0.087 0.931"
## [27] "hour18 8.350e-01 5.149e+00 0.162 0.871"
## [28] "hour19 6.942e-01 5.172e+00 0.134 0.893"
## [29] "hour20 6.248e-01 5.200e+00 0.120 0.904"
## [30] "hour21 6.148e-01 5.161e+00 0.119 0.905"
## [31] "hour22 5.117e-01 5.290e+00 0.097 0.923"
## [32] "hour23 2.298e-01 5.634e+00 0.041 0.967"
## [33] "poly(temperature, degree = 3)1 2.335e+01 1.699e+02 0.137 0.891"
## [34] "poly(temperature, degree = 3)2 -7.265e+00 7.035e+01 -0.103 0.918"
## [35] "poly(temperature, degree = 3)3 -5.691e+00 6.211e+01 -0.092 0.927"
## [36] "wind_speed -1.037e-02 9.376e-01 -0.011 0.991"
## [37] "visibility 6.394e-05 1.987e-03 0.032 0.974"
## [38] "dew -1.920e-02 2.384e-01 -0.081 0.936"
## [39] "solar 7.049e-02 1.999e+00 0.035 0.972"
## [40] "rainfall -5.584e-01 3.753e+00 -0.149 0.882"
## [41] "snowfall -1.494e-01 4.890e+00 -0.031 0.976"
capture.output(summary(glm.full1))[58]
## [1] "AIC: 579529"
drop1(glm.full3q,test="F")
## Single term deletions
##
## Model:
## bike_count ~ month + hour + poly(temperature, degree = 3) + wind_speed +
## visibility + dew + solar + rainfall + snowfall
## Df Deviance F value Pr(>F)
## <none> 484023
## month 7 524092 50.4970 < 2.2e-16 ***
## hour 23 1207089 277.3392 < 2.2e-16 ***
## poly(temperature, degree = 3) 3 578264 277.1263 < 2.2e-16 ***
## wind_speed 1 484314 2.5656 0.1093
## visibility 1 486484 21.7062 3.274e-06 ***
## dew 1 499307 134.8271 < 2.2e-16 ***
## solar 1 486985 26.1297 3.332e-07 ***
## rainfall 1 623286 1228.5614 < 2.2e-16 ***
## snowfall 1 486757 24.1177 9.401e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For considering overdispersion effect, family “quasipoisson” is used to build the new model. Most of p-values became larger (p > 0.05). It shows that all variables are insignificant. However, when the “quasipoisson” model is tested by “F test”, it reveals that only the p value of “wind-speed” is larger than 0.05. Two results are completely different, therefore, negative binomial model is tested.
library(glmmTMB)
glm.full3nb0 <- glmmTMB(bike_count ~ month + hour + poly(temperature, degree=3)
+ wind_speed + visibility + dew + solar + rainfall+ snowfall,
data = train.p ,family="nbinom2")
summary(glm.full3nb0)
## Family: nbinom2 ( log )
## Formula:
## bike_count ~ month + hour + poly(temperature, degree = 3) + wind_speed +
## visibility + dew + solar + rainfall + snowfall
## Data: train.p
##
## AIC BIC logLik deviance df.resid
## NA NA NA NA 4269
##
##
## Overdispersion parameter for nbinom2 family (): 3.71
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.6398461 0.0539453 123.08 < 2e-16 ***
## month5 0.1550457 0.0344568 4.50 6.80e-06 ***
## month6 0.2845797 0.0427726 6.65 2.87e-11 ***
## month7 0.0103328 0.0525515 0.20 0.844123
## month8 -0.1210934 0.0539623 -2.24 0.024830 *
## month9 0.0802032 0.0410570 1.95 0.050765 .
## month10 0.3858171 0.0320035 12.06 < 2e-16 ***
## month11 0.4146592 0.0369253 11.23 < 2e-16 ***
## hour1 -0.2364749 0.0546135 -4.33 1.49e-05 ***
## hour2 -0.5672537 0.0548426 -10.34 < 2e-16 ***
## hour3 -0.8963052 0.0555463 -16.14 < 2e-16 ***
## hour4 -1.2941545 0.0549593 -23.55 < 2e-16 ***
## hour5 -1.2377618 0.0558382 -22.17 < 2e-16 ***
## hour6 -0.4688265 0.0552406 -8.49 < 2e-16 ***
## hour7 0.2105909 0.0555062 3.79 0.000148 ***
## hour8 0.6161298 0.0558433 11.03 < 2e-16 ***
## hour9 0.0621336 0.0580158 1.07 0.284180
## hour10 -0.3323503 0.0608565 -5.46 4.73e-08 ***
## hour11 -0.3678135 0.0645079 -5.70 1.19e-08 ***
## hour12 -0.2946353 0.0666299 -4.42 9.78e-06 ***
## hour13 -0.2307972 0.0669570 -3.45 0.000567 ***
## hour14 -0.3205165 0.0656518 -4.88 1.05e-06 ***
## hour15 -0.1668450 0.0633620 -2.63 0.008458 **
## hour16 0.0102945 0.0608231 0.17 0.865597
## hour17 0.3101779 0.0588425 5.27 1.35e-07 ***
## hour18 0.7510536 0.0571758 13.14 < 2e-16 ***
## hour19 0.5957831 0.0562003 10.60 < 2e-16 ***
## hour20 0.5553654 0.0554031 10.02 < 2e-16 ***
## hour21 0.5770352 0.0546088 10.57 < 2e-16 ***
## hour22 0.4786738 0.0548400 8.73 < 2e-16 ***
## hour23 0.2052227 0.0545917 3.76 0.000170 ***
## poly(temperature, degree = 3)1 33.9182267 1.8507807 18.33 < 2e-16 ***
## poly(temperature, degree = 3)2 -7.5115579 0.6748325 -11.13 < 2e-16 ***
## poly(temperature, degree = 3)3 -5.6975371 0.5927848 -9.61 < 2e-16 ***
## wind_speed -0.0162001 0.0101921 -1.59 0.111950
## visibility 0.0001531 NA NA NA
## dew -0.0304058 0.0021653 -14.04 < 2e-16 ***
## solar 0.1099859 0.0209383 5.25 1.50e-07 ***
## rainfall -0.1157364 0.0040745 -28.41 < 2e-16 ***
## snowfall -0.0980377 0.0255153 -3.84 0.000122 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glm.full3nb <- glmmTMB(bike_count ~ month + hour + poly(temperature, degree=3)
+ dew + solar + rainfall+ snowfall, data = train.p, family="nbinom2")
summary(glm.full3nb)
## Family: nbinom2 ( log )
## Formula:
## bike_count ~ month + hour + poly(temperature, degree = 3) + dew +
## solar + rainfall + snowfall
## Data: train.p
##
## AIC BIC logLik deviance df.resid
## 63028.9 63277.3 -31475.5 62950.9 4271
##
##
## Overdispersion parameter for nbinom2 family (): 3.66
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.886552 0.054655 126.00 < 2e-16 ***
## month5 0.215965 0.034991 6.17 6.75e-10 ***
## month6 0.362799 0.043449 8.35 < 2e-16 ***
## month7 0.184775 0.054551 3.39 0.000706 ***
## month8 0.073191 0.056507 1.30 0.195228
## month9 0.244065 0.043318 5.63 1.76e-08 ***
## month10 0.453604 0.032264 14.06 < 2e-16 ***
## month11 0.381302 0.036399 10.48 < 2e-16 ***
## hour1 -0.237790 0.054932 -4.33 1.50e-05 ***
## hour2 -0.576276 0.055159 -10.45 < 2e-16 ***
## hour3 -0.906657 0.055794 -16.25 < 2e-16 ***
## hour4 -1.304172 0.055261 -23.60 < 2e-16 ***
## hour5 -1.255152 0.056071 -22.39 < 2e-16 ***
## hour6 -0.476170 0.055407 -8.59 < 2e-16 ***
## hour7 0.193851 0.055768 3.48 0.000509 ***
## hour8 0.598098 0.056188 10.64 < 2e-16 ***
## hour9 0.039317 0.058416 0.67 0.500911
## hour10 -0.345654 0.061237 -5.64 1.66e-08 ***
## hour11 -0.371640 0.064917 -5.72 1.04e-08 ***
## hour12 -0.299812 0.066994 -4.48 7.63e-06 ***
## hour13 -0.238073 0.067189 -3.54 0.000395 ***
## hour14 -0.335038 0.065722 -5.10 3.44e-07 ***
## hour15 -0.179450 0.063316 -2.83 0.004594 **
## hour16 -0.002918 0.060565 -0.05 0.961576
## hour17 0.303001 0.058442 5.18 2.16e-07 ***
## hour18 0.745131 0.056939 13.09 < 2e-16 ***
## hour19 0.594710 0.056082 10.60 < 2e-16 ***
## hour20 0.554264 0.055415 10.00 < 2e-16 ***
## hour21 0.578150 0.054889 10.53 < 2e-16 ***
## hour22 0.483622 0.055211 8.76 < 2e-16 ***
## hour23 0.203223 0.054968 3.70 0.000218 ***
## poly(temperature, degree = 3)1 36.299012 1.873175 19.38 < 2e-16 ***
## poly(temperature, degree = 3)2 -7.521284 0.680379 -11.05 < 2e-16 ***
## poly(temperature, degree = 3)3 -5.676800 0.596561 -9.52 < 2e-16 ***
## dew -0.041416 0.002357 -17.57 < 2e-16 ***
## solar 0.097043 0.020944 4.63 3.60e-06 ***
## rainfall -0.116998 0.004094 -28.58 < 2e-16 ***
## snowfall -0.104882 0.025385 -4.13 3.60e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
detach("package:glmmTMB", unload=TRUE)
Given by the result of negative binomial model, “wind-speed” and “visibility” are not significant. They are therefore removed from the model. And a new negative binomial model (glm.full3nb) is built.
Comparing three models, Negative Binomial model owns the smallest AIC value, although the RMSE is slightly higher than others.
#Poisson
model_performance(glm.full3)
#Quasipoisson
model_performance(glm.full3q)
#Negative Binomial
model_performance(glm.full3nb)
From the Predicted vs.Residuals plot, Poisson and Quasipoisson model are almost the same. Most residuals are between -50 and 50. Compared to the predicted value, these residual values seem reasonable. In contrary, residuals of Negative Binomial model are very large compared to its predicted value.
par(mfrow=c(1,2))
#Poisson
plot(predict(glm.full3,type="response",),residuals(glm.full3), main="Poisson",
ylab="Residuals", xlab="Predicted", col="skyblue")
abline(h=0,lty=1,col="gray")
lines(lowess(predict(glm.full3,type="response"),residuals(glm.full3)),lwd=2,lty=2,col="blue")
#Quasipoisson
plot(predict(glm.full3q,type="response"),residuals(glm.full3q), main="Quasipoisson",
ylab="Residuals", xlab="Predicted", col="skyblue")
abline(h=0,lty=1,co1="gray")
lines(lowess(predict(glm.full3q,type="response"),residuals(glm.full3q)),lwd=2,lty=2,col="blue")
par(mfrow=c(1,1))
#Negative Binomial
plot(predict(glm.full3nb,type="response"),residuals(glm.full3nb), main="Negative Binomial",
ylab="Residuals", xlab="Predicted", col="skyblue")
abline(h=0,lty=1,col="gray")
lines(lowess(predict(glm.full3nb,type="response"),residuals(glm.full3nb)),lwd=2,lty=2,col="blue")
pred.glm3 <- predict(glm.full3nb, type = "response", newdata = test.p)
pred.glmq <- predict(glm.full3q, type = "response", newdata = test.p)
pred.glmnb <- predict(glm.full3nb, type = "response", newdata = test.p)
#Poisson
R.glm3 <- cor(pred.glm3, test.p$bike_count)^2
#Quasipoisson
R.glmq <- cor(pred.glmq, test.p$bike_count)^2
#Negative Binomial
R.glmnb <- cor(pred.glmnb, test.p$bike_count)^2
Conclusion: Theoretically, Negative Binomial model is supposed to be the most reasonable of these three models. Since it doesn’t show overdispersion problem as Poisson model. And unlike Quasipoisson model, its variables are significant. However, its residuals are the largest among three models. It may be caused by the correlation or collinearity from “month”, “hour” and weather related variable.
ggplot(data = bikes, aes(x = highdemand, y = bike_count, fill = highdemand))+
geom_boxplot(alpha=0.2) + labs(fill ="Demand > 500", y="Bike Count") + xlab("")
From the boxplots in the Graphic Analysis section, if-highdemand doesn’t show a significant difference among weekdays. Therefore, this variable would be removed from the beginning model.
glm_bike1 <- glm(highdemand ~ month + hour + temperature + humidity + wind_speed + visibility
+ dew + solar + rainfall + snowfall + holiday, family = "binomial", data = train)
summary(glm_bike1)
##
## Call:
## glm(formula = highdemand ~ month + hour + temperature + humidity +
## wind_speed + visibility + dew + solar + rainfall + snowfall +
## holiday, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7805 -0.1922 0.0363 0.2574 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.938e-01 1.103e+00 -0.629 0.529431
## month2 2.771e-01 3.462e-01 0.800 0.423519
## month3 2.449e+00 3.357e-01 7.295 2.99e-13 ***
## month4 2.660e+00 3.712e-01 7.166 7.70e-13 ***
## month5 3.018e+00 4.186e-01 7.211 5.57e-13 ***
## month6 4.044e+00 4.865e-01 8.312 < 2e-16 ***
## month7 2.358e+00 5.214e-01 4.523 6.11e-06 ***
## month8 1.880e+00 5.407e-01 3.478 0.000506 ***
## month9 3.101e+00 4.808e-01 6.451 1.11e-10 ***
## month10 4.348e+00 3.981e-01 10.922 < 2e-16 ***
## month11 4.573e+00 3.444e-01 13.278 < 2e-16 ***
## month12 1.486e+00 3.265e-01 4.550 5.36e-06 ***
## hour1 -9.038e-01 2.506e-01 -3.607 0.000310 ***
## hour2 -2.552e+00 2.632e-01 -9.695 < 2e-16 ***
## hour3 -4.817e+00 4.020e-01 -11.982 < 2e-16 ***
## hour4 -1.987e+01 2.989e+02 -0.066 0.946995
## hour5 -1.974e+01 2.997e+02 -0.066 0.947477
## hour6 -1.995e+00 2.612e-01 -7.639 2.19e-14 ***
## hour7 -3.403e-01 2.602e-01 -1.308 0.190981
## hour8 2.758e+00 3.291e-01 8.381 < 2e-16 ***
## hour9 5.675e-01 3.270e-01 1.735 0.082670 .
## hour10 -1.118e+00 3.300e-01 -3.388 0.000704 ***
## hour11 -1.565e+00 3.554e-01 -4.404 1.06e-05 ***
## hour12 -1.163e+00 3.926e-01 -2.961 0.003064 **
## hour13 -1.168e+00 3.898e-01 -2.996 0.002740 **
## hour14 -1.578e+00 3.749e-01 -4.209 2.56e-05 ***
## hour15 -7.640e-01 3.669e-01 -2.082 0.037303 *
## hour16 -5.292e-01 3.483e-01 -1.519 0.128699
## hour17 9.493e-01 3.410e-01 2.784 0.005377 **
## hour18 2.857e+00 3.389e-01 8.430 < 2e-16 ***
## hour19 1.268e+00 3.242e-01 3.911 9.20e-05 ***
## hour20 1.139e+00 3.069e-01 3.710 0.000207 ***
## hour21 1.085e+00 3.047e-01 3.561 0.000369 ***
## hour22 1.128e+00 2.933e-01 3.844 0.000121 ***
## hour23 6.659e-01 2.793e-01 2.384 0.017118 *
## temperature 7.894e-02 4.486e-02 1.760 0.078437 .
## humidity -6.828e-02 1.169e-02 -5.839 5.26e-09 ***
## wind_speed -1.453e-01 5.653e-02 -2.571 0.010138 *
## visibility 8.618e-05 1.117e-04 0.772 0.440350
## dew 1.152e-01 4.548e-02 2.534 0.011283 *
## solar 1.052e+00 1.542e-01 6.825 8.78e-12 ***
## rainfall -1.703e+00 1.956e-01 -8.707 < 2e-16 ***
## snowfall -8.547e-01 2.326e-01 -3.675 0.000238 ***
## holidayNo Holiday 1.195e+00 2.457e-01 4.864 1.15e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9397.0 on 6784 degrees of freedom
## Residual deviance: 2865.5 on 6741 degrees of freedom
## AIC: 2953.5
##
## Number of Fisher Scoring iterations: 17
Visibility isn’t significant in this model. Therefore, it’d be removed in the next model.
glm_bike2 <- update(glm_bike1, . ~ . -visibility)
summary(glm_bike2)
##
## Call:
## glm(formula = highdemand ~ month + hour + temperature + humidity +
## wind_speed + dew + solar + rainfall + snowfall + holiday,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7745 -0.1922 0.0363 0.2583 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.52373 1.08287 -0.484 0.628631
## month2 0.28685 0.34600 0.829 0.407079
## month3 2.48574 0.33223 7.482 7.32e-14 ***
## month4 2.72475 0.36124 7.543 4.60e-14 ***
## month5 3.10162 0.40408 7.676 1.65e-14 ***
## month6 4.14295 0.46897 8.834 < 2e-16 ***
## month7 2.50865 0.48286 5.195 2.04e-07 ***
## month8 2.04771 0.49464 4.140 3.48e-05 ***
## month9 3.24488 0.44315 7.322 2.44e-13 ***
## month10 4.44995 0.37572 11.844 < 2e-16 ***
## month11 4.60928 0.34128 13.506 < 2e-16 ***
## month12 1.50369 0.32571 4.617 3.90e-06 ***
## hour1 -0.90589 0.25050 -3.616 0.000299 ***
## hour2 -2.55525 0.26311 -9.712 < 2e-16 ***
## hour3 -4.82368 0.40243 -11.986 < 2e-16 ***
## hour4 -19.89581 299.00803 -0.067 0.946948
## hour5 -19.77708 299.89406 -0.066 0.947420
## hour6 -2.00082 0.26107 -7.664 1.80e-14 ***
## hour7 -0.35047 0.25982 -1.349 0.177359
## hour8 2.75119 0.32913 8.359 < 2e-16 ***
## hour9 0.55466 0.32662 1.698 0.089467 .
## hour10 -1.12421 0.32980 -3.409 0.000652 ***
## hour11 -1.56726 0.35521 -4.412 1.02e-05 ***
## hour12 -1.15873 0.39234 -2.953 0.003143 **
## hour13 -1.16772 0.38932 -2.999 0.002706 **
## hour14 -1.56993 0.37476 -4.189 2.80e-05 ***
## hour15 -0.76049 0.36665 -2.074 0.038063 *
## hour16 -0.53006 0.34830 -1.522 0.128040
## hour17 0.94612 0.34109 2.774 0.005540 **
## hour18 2.85488 0.33885 8.425 < 2e-16 ***
## hour19 1.26808 0.32420 3.911 9.18e-05 ***
## hour20 1.13670 0.30676 3.706 0.000211 ***
## hour21 1.08006 0.30448 3.547 0.000389 ***
## hour22 1.12831 0.29322 3.848 0.000119 ***
## hour23 0.66314 0.27916 2.375 0.017525 *
## temperature 0.07659 0.04485 1.708 0.087682 .
## humidity -0.06981 0.01155 -6.045 1.50e-09 ***
## wind_speed -0.14263 0.05637 -2.530 0.011403 *
## dew 0.11438 0.04557 2.510 0.012064 *
## solar 1.04284 0.15361 6.789 1.13e-11 ***
## rainfall -1.71191 0.19541 -8.761 < 2e-16 ***
## snowfall -0.86357 0.23320 -3.703 0.000213 ***
## holidayNo Holiday 1.19967 0.24586 4.879 1.06e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9397.0 on 6784 degrees of freedom
## Residual deviance: 2865.8 on 6742 degrees of freedom
## AIC: 2951.8
##
## Number of Fisher Scoring iterations: 17
The AIC value decreases slightly in the result. Next, we’ll try to remove the temperature variable, which seems not significant in this model.
glm_bike3 <- update(glm_bike2, . ~ . -temperature)
summary(glm_bike3)
##
## Call:
## glm(formula = highdemand ~ month + hour + humidity + wind_speed +
## dew + solar + rainfall + snowfall + holiday, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7809 -0.1940 0.0372 0.2593 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.073700 0.563336 1.906 0.056654 .
## month2 0.352329 0.342329 1.029 0.303381
## month3 2.515200 0.331156 7.595 3.07e-14 ***
## month4 2.799388 0.358021 7.819 5.32e-15 ***
## month5 3.198618 0.399836 8.000 1.25e-15 ***
## month6 4.263817 0.463821 9.193 < 2e-16 ***
## month7 2.647507 0.475746 5.565 2.62e-08 ***
## month8 2.185986 0.487735 4.482 7.40e-06 ***
## month9 3.357322 0.438122 7.663 1.82e-14 ***
## month10 4.502452 0.374430 12.025 < 2e-16 ***
## month11 4.658019 0.339528 13.719 < 2e-16 ***
## month12 1.491848 0.325124 4.589 4.46e-06 ***
## hour1 -0.906277 0.250302 -3.621 0.000294 ***
## hour2 -2.561136 0.263187 -9.731 < 2e-16 ***
## hour3 -4.876339 0.407576 -11.964 < 2e-16 ***
## hour4 -21.360835 486.595363 -0.044 0.964985
## hour5 -21.335727 488.027075 -0.044 0.965129
## hour6 -2.013862 0.261070 -7.714 1.22e-14 ***
## hour7 -0.364803 0.259759 -1.404 0.160203
## hour8 2.720974 0.327573 8.306 < 2e-16 ***
## hour9 0.520355 0.326096 1.596 0.110554
## hour10 -1.159124 0.329442 -3.518 0.000434 ***
## hour11 -1.595459 0.354734 -4.498 6.87e-06 ***
## hour12 -1.177347 0.392124 -3.002 0.002678 **
## hour13 -1.169983 0.390166 -2.999 0.002712 **
## hour14 -1.546614 0.374837 -4.126 3.69e-05 ***
## hour15 -0.726932 0.367113 -1.980 0.047689 *
## hour16 -0.494561 0.348061 -1.421 0.155345
## hour17 0.970022 0.340665 2.847 0.004407 **
## hour18 2.868751 0.338530 8.474 < 2e-16 ***
## hour19 1.279611 0.323058 3.961 7.47e-05 ***
## hour20 1.144665 0.305692 3.745 0.000181 ***
## hour21 1.080319 0.303612 3.558 0.000373 ***
## hour22 1.123475 0.292356 3.843 0.000122 ***
## hour23 0.661801 0.278816 2.374 0.017615 *
## humidity -0.087563 0.005371 -16.303 < 2e-16 ***
## wind_speed -0.144830 0.056299 -2.573 0.010096 *
## dew 0.188817 0.014036 13.453 < 2e-16 ***
## solar 1.090947 0.151043 7.223 5.09e-13 ***
## rainfall -1.674997 0.193273 -8.666 < 2e-16 ***
## snowfall -0.870396 0.235197 -3.701 0.000215 ***
## holidayNo Holiday 1.225778 0.246854 4.966 6.85e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9397.0 on 6784 degrees of freedom
## Residual deviance: 2870.2 on 6743 degrees of freedom
## AIC: 2954.2
##
## Number of Fisher Scoring iterations: 18
However, the AIC value in the model 3 is higher than the model 2. Therefore, we’ll continue with the model 2. Furthermore, the residual deviance is 2870.2 on 6743 degrees of freedom. If the data was truly Binomial distributed, these two numbers have to be the same. It indicates that this data is overdispersed.
As the overdispersion appeared, we’ll change to quasibinomial method.
glm_bike_q1 <- glm(highdemand ~ month + hour + temperature + humidity + wind_speed + visibility
+ dew + solar + rainfall + snowfall + holiday,
family = "quasibinomial", data = train)
drop1(glm_bike_q1, test="F")
## Single term deletions
##
## Model:
## highdemand ~ month + hour + temperature + humidity + wind_speed +
## visibility + dew + solar + rainfall + snowfall + holiday
## Df Deviance F value Pr(>F)
## <none> 2865.5
## month 11 3513.1 138.5039 < 2.2e-16 ***
## hour 23 4672.4 184.8110 < 2.2e-16 ***
## temperature 1 2870.1 10.7735 0.0010349 **
## humidity 1 2893.5 65.8492 5.739e-16 ***
## wind_speed 1 2871.9 14.9870 0.0001093 ***
## visibility 1 2865.8 0.7211 0.3958098
## dew 1 2869.4 9.2837 0.0023209 **
## solar 1 2913.4 112.7553 < 2.2e-16 ***
## rainfall 1 3078.4 500.9053 < 2.2e-16 ***
## snowfall 1 2889.3 55.9760 8.269e-14 ***
## holiday 1 2889.1 55.4114 1.099e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Visibility seems not to be significant in the model. It will be removed in the next one.
glm_bike_q2 <- update(glm_bike_q1, . ~ . -visibility)
drop1(glm_bike_q2, test="F")
## Single term deletions
##
## Model:
## highdemand ~ month + hour + temperature + humidity + wind_speed +
## dew + solar + rainfall + snowfall + holiday
## Df Deviance F value Pr(>F)
## <none> 2865.8
## month 11 3518.5 139.5950 < 2.2e-16 ***
## hour 23 4674.5 185.0041 < 2.2e-16 ***
## temperature 1 2870.2 10.3745 0.0012837 **
## humidity 1 2895.7 70.2623 < 2.2e-16 ***
## wind_speed 1 2872.0 14.6202 0.0001327 ***
## dew 1 2869.7 9.1196 0.0025383 **
## solar 1 2913.4 111.9571 < 2.2e-16 ***
## rainfall 1 3083.3 511.7643 < 2.2e-16 ***
## snowfall 1 2889.9 56.7016 5.734e-14 ***
## holiday 1 2889.5 55.6375 9.808e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since quasibinomial model doesn’t provide an AIC value, we’ll evaluate it in the next session by comparing the correct rates.
fitted(glm_bike2) %>% round(digits =2)
fitted(glm_bike_q2) %>% round(digits =2)
fitted.bikes.disc <- ifelse(fitted(glm_bike2) < 0.5,
yes=0, no=1)
b.obs.fit <- data.frame(obs = train$highdemand, fitted = fitted.bikes.disc)
table(obs = b.obs.fit$obs,
fit = b.obs.fit$fitted)
## fit
## obs 0 1
## 0 2955 314
## 1 279 3237
2,955 observations were correctly labeled to low-demand and 3,237 were correctly labeled to high-demand. Meanwhile, 279 low-demand observations were falsely classified to be high-demand in this model, while 314 high-demand ones to be low-demand.
fitted.bikes.disc_q <- ifelse(fitted(glm_bike_q2) < 0.5,
yes=0, no=1)
b.obs.fit_q <- data.frame(obs = train$highdemand, fitted = fitted.bikes.disc_q)
table(obs = b.obs.fit_q$obs,
fit = b.obs.fit_q$fitted)
## fit
## obs 0 1
## 0 2955 314
## 1 279 3237
Two models produce the same result. So we’ll just use one of them for further process.
pred.glmb <- ifelse(predict(glm_bike2, type = "response", newdata = test) < 0.5,
yes=0, no=1)
matrix.glmb <- confusionMatrix(as.factor(test$highdemand), as.factor(pred.glmb))
matrix.glmb
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 736 61
## 1 76 807
##
## Accuracy : 0.9185
## 95% CI : (0.9043, 0.9311)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8366
##
## Mcnemar's Test P-Value : 0.2317
##
## Sensitivity : 0.9064
## Specificity : 0.9297
## Pos Pred Value : 0.9235
## Neg Pred Value : 0.9139
## Prevalence : 0.4833
## Detection Rate : 0.4381
## Detection Prevalence : 0.4744
## Balanced Accuracy : 0.9181
##
## 'Positive' Class : 0
##
acc.glmb <- matrix.glmb$overall['Accuracy']
Conclusion: Even though the model reached 0.91 accuracy, the fact of overdispersed data is needed to be considered. There might another way to better fit the binomial model family, or there might be another model beter than this one for the dataset.
From the graphic analysis section, it seems that high-demand factor showed differences in all variables. Therefore, we’ll put them into the svm model. Firstly, we’d separate the data and start building a model.
library(e1071)
svm <- train[, c("highdemand", "month", "wday", "hour", "temperature", "humidity", "wind_speed",
"visibility", "dew", "solar", "rainfall", "snowfall", "holiday")]
svm.test <- test[, c("month", "wday", "hour", "temperature", "humidity", "wind_speed",
"visibility", "dew", "solar", "rainfall", "snowfall", "holiday")]
svm.test_truth <- test[, c("highdemand", "month", "wday", "hour", "temperature", "humidity",
"wind_speed", "visibility", "dew", "solar", "rainfall", "snowfall",
"holiday")] %>% pull (highdemand)
bikes_svm <- svm(highdemand ~ ., svm, kernel = "linear", scale = TRUE, cost =10)
plot(bikes_svm, svm, temperature ~ dew)
pred.svm <- predict(bikes_svm, svm.test)
matrix.svm <- confusionMatrix(svm.test_truth, pred.svm)
matrix.svm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 739 58
## 1 70 813
##
## Accuracy : 0.9238
## 95% CI : (0.9101, 0.936)
## No Information Rate : 0.5185
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8473
##
## Mcnemar's Test P-Value : 0.3309
##
## Sensitivity : 0.9135
## Specificity : 0.9334
## Pos Pred Value : 0.9272
## Neg Pred Value : 0.9207
## Prevalence : 0.4815
## Detection Rate : 0.4399
## Detection Prevalence : 0.4744
## Balanced Accuracy : 0.9234
##
## 'Positive' Class : 0
##
acc.svm <- matrix.svm$overall['Accuracy']
Conclusion: In the test data using the svm model, 739 observations were correctly labeled to low-demand and 813 were correctly labeled to high-demand. Meanwhile, 70 low-demand observations were falsely classified to be high-demand in this model, while 58 high-demand ones to be low-demand. Overall, the model has 92% accuracy.
Read necessary libraries and choose necessary variables from the train and test data sets for the model.
library(nnet)
library(gamlss.add)
bikes_nn <- train[, c("highdemand", "month", "wday", "hour", "temperature", "humidity",
"wind_speed", "visibility", "dew", "solar", "rainfall", "snowfall", "holiday")]
bikes_nn.test <- test[, c("highdemand", "month", "wday", "hour", "temperature", "humidity",
"wind_speed", "visibility", "dew", "solar", "rainfall", "snowfall", "holiday")]
str(bikes_nn)
## 'data.frame': 6785 obs. of 13 variables:
## $ highdemand : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
## $ wday : Factor w/ 7 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ hour : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 10 11 ...
## $ temperature: num -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -6.5 -3.5 ...
## $ humidity : int 37 38 39 40 36 37 35 38 27 24 ...
## $ wind_speed : num 2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 0.5 1.2 ...
## $ visibility : int 2000 2000 2000 2000 2000 2000 2000 2000 1928 1996 ...
## $ dew : num -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -22.4 -21.2 ...
## $ solar : num 0 0 0 0 0 0 0 0 0.23 0.65 ...
## $ rainfall : num 0 0 0 0 0 0 0 0 0 0 ...
## $ snowfall : num 0 0 0 0 0 0 0 0 0 0 ...
## $ holiday : Factor w/ 2 levels "Holiday","No Holiday": 2 2 2 2 2 2 2 2 2 2 ...
bikes %>%
ggplot(aes (x=temperature, fill = highdemand)) +
geom_histogram(position="stack")
From the histogram of temperature and bike counts, most high demand events are located under higher temperature, while the low demand is more in the lower temperature.
demand_net <- nnet(highdemand ~ ., data = bikes_nn, size = 10, maxit=100,
range=0.1, decay=5e-4, MaxNWts=65123)
## # weights: 511
## initial value 4622.897089
## iter 10 value 3506.628868
## iter 20 value 2827.514316
## iter 30 value 1978.542868
## iter 40 value 1911.970149
## iter 50 value 1711.506443
## iter 60 value 1524.615852
## iter 70 value 1446.100993
## iter 80 value 1379.359325
## iter 90 value 1304.954104
## iter 100 value 1265.941197
## final value 1265.941197
## stopped after 100 iterations
plot(demand_net)
demand_net
## a 49-10-1 network with 511 weights
## inputs: month2 month3 month4 month5 month6 month7 month8 month9 month10 month11 month12 wday2 wday3 wday4 wday5 wday6 wday7 hour1 hour2 hour3 hour4 hour5 hour6 hour7 hour8 hour9 hour10 hour11 hour12 hour13 hour14 hour15 hour16 hour17 hour18 hour19 hour20 hour21 hour22 hour23 temperature humidity wind_speed visibility dew solar rainfall snowfall holidayNo Holiday
## output(s): highdemand
## options were - entropy fitting decay=5e-04
pred.nn <- predict(demand_net, bikes_nn.test, type = "class")
b_nn <- table(pred=pred.nn, true=bikes_nn.test$highdemand)
b_nn
## true
## pred 0 1
## 0 735 63
## 1 62 820
matrix.nn <- confusionMatrix(as.factor(bikes_nn.test$highdemand), as.factor(pred.nn))
matrix.nn
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 735 62
## 1 63 820
##
## Accuracy : 0.9256
## 95% CI : (0.912, 0.9377)
## No Information Rate : 0.525
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8508
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9211
## Specificity : 0.9297
## Pos Pred Value : 0.9222
## Neg Pred Value : 0.9287
## Prevalence : 0.4750
## Detection Rate : 0.4375
## Detection Prevalence : 0.4744
## Balanced Accuracy : 0.9254
##
## 'Positive' Class : 0
##
acc.nn <- matrix.nn$overall['Accuracy']
Check the model accuracy. It shows that this model achieved 0.93 accuracy.
ROC Curve
library(ROCR)
pred_raw <- predict(demand_net, bikes_nn.test, decision.values=TRUE, type="raw")
pred <- ROCR::prediction(pred_raw, bikes_nn.test$highdemand)
perf <- ROCR::performance(pred, "tpr", "fpr")
plot(perf, lwd=2, col="blue")
abline(a=0, b=1)
Conclusion: In the test data using the Neural Network model, 748 observations were correctly labeled to low-demand and 813 were correctly labeled to high-demand. Meanwhile, 70 low-demand observations were falsely classified to be high-demand in this model, while 49 high-demand ones to be low-demand. Overall, the model has 93% accuracy.
To split the data into training and testing data sets, we used 20-fold (80% of the data are used to train the model and 20% to test it). We aim at finding a model that maximizes the predictive performance of demand for bikes rentals.
Three models - Linear Model, Generalized Additive Model (GAM), GLM with Poisson - predict the bike_count variable. We used R-squared as a measure of fit because of the three following reasons: (1) ease of computation, (2) it makes sure that we have less extreme outliers, (3) it is symmetric. The last advantage is significant in our case because over- or underestimated demand for bikes rentals would damage customer experience of the service.
The R-squared values are provided in the table below.
R.df <- data.frame(Linear_Model = round(R.lm,3), GAM = round(R.gam,3),
GLM_Poisson = round(R.glm3, 3), GLM_Quasipoisson = round(R.glmq, 3),
GLM_Negative_Binomial = round(R.glmnb, 3))
R.df
Three other models - GLM with Binomial, Support Vector Machine, Neural Network - predict the binary highdemand variable. As a measure of predictive performance for these models, we used the accuracy value from confusion matrix (the degree of closeness to actual value).
acc.df <- data.frame(GLM_Binomial = round(acc.glmb,3),
SVM = round(acc.svm,3), Neural_Network = round(acc.nn, 3))
acc.df
Neural Network has the highest accuracy (0.93). Thus, this model is recommended to forecast whether the level of demand is low or high.
Predict Bike Counts
The prediction of bike count could help the business understand how much bikes they should get prepared for the user needs and how many maintenance workers to be deployed on this day under certain weather and other situations. When the number of bikes and the number of workers fits the number of bikes to be used on that day, the company could offer quality customer experiences for its users - smooth and convenient biking experience.
The R-squared values of three models - Linear Model, Generalized Additive Model (GAM), GLM with Poisson - show that GLM with Quasiposson has the highest R-squared values (79%). However, all variables in the model of GLM with Quasiposson are insignificant (seen in the modeling section 5.3). The model is therefore not applicable. The better model for the business to predict is GLM with Negative_Binomial, which could explain 72.6% of the data.
Predict If-Highdemand
The assumption here is that, if the rental bike count is over 500, it belongs to high-demand situation; on the other hand, if it’s below 500, it belongs to low-demand situation.
The prediction of high or low-demand could help the business allocate its resources better, from the number of workers to the number of bikes needed to be repaired. In the time of low-demand, for example, the company could send more workers on bike maintenance and repairing in the factory and allow only few workers outside monitoring bike-sharing stations. In the time of high-demand, instead, the company needs more bikes being used and more workers outside monitoring the stations.
The accuracy values resulted from three models - GLM with Binomial, Support Vector Machine, Neural Network - show that Neural Network would be the better model for the company to predict high- or low-demand, with about 92.6% accuracy.
Modeling - a Never-Ending Story
We did split the original data set into train and test purposes. In other words, the accuracy of the models was tested with previous data. However, business operation, market competition and customer behavior change every day. Moreover, we based our predictions on the historical weather data. Our models can be inaccurate in future predictions because they will be based on the weather forecast and their accuracy depends on it. So, we do not advise our client to make a monthly forecast but rather a weekly forecast.
Furthermore, the models included in this project aren’t every modeling methods existing in the field. The results could be seen as a reference, rather than a best solution. Modeling, anyway, is a never-ending story - more latest data and more understanding of various predictors would definite improve and change the models.
Since the NetLogoR package had been removed from the CRAN repository, we downloaded 20 libraries required for its installation.
#install.packages("ellipsis")
#install.packages("xfun")
#install.packages("data.table", dependencies=TRUE)
#install.packages("devtools")
#require(devtools)
#packageurl_1 <- "https://cran.r-project.org/src/contrib/fastmatch_1.1-0.tar.gz"
#packageurl_2 <-"https://stat.ethz.ch/CRAN//src/contrib/DBI_1.1.1.tar.gz"
#packageurl_3 <-"https://cran.r-project.org/src/contrib/bit_4.0.4.tar.gz"
#packageurl_4 <-"https://cran.r-project.org/src/contrib/blob_1.2.1.tar.gz"
#packageurl_5 <-"https://cran.r-project.org/src/contrib/plogr_0.2.0.tar.gz"
#packageurl_6 <- "https://cran.microsoft.com/snapshot/2017-08-06/src/contrib/bit64_0.9-7.tar.gz"
#packageurl_7 <-"https://cran.r-project.org/src/contrib/RSQLite_2.2.7.tar.gz"
#packageurl_8 <- "https://cran.r-project.org/src/contrib/Archive/Require/Require_0.0.12.tar.gz"
#packageurl_9 <- "https://cran.r-project.org/src/contrib/fpCompare_0.2.3.tar.gz"
#packageurl_10 <- "https://cran.r-project.org/src/contrib/sp_1.4-5.tar.gz"
#packageurl_11 <- "https://cran.r-project.org/src/contrib/raster_3.4-10.tar.gz"
#packageurl_12 <- "https://cran.r-project.org/src/contrib/reproducible_1.2.7.tar.gz"
#packageurl_13 <- "https://cran.r-project.org/src/contrib/htmlTable_2.2.1.tar.gz"
#packageurl_14 <- "https://cran.r-project.org/src/contrib/viridis_0.6.1.tar.gz"
#packageurl_15 <- "https://cran.r-project.org/src/contrib/Hmisc_4.5-0.tar.gz"
#packageurl_16 <- "https://cran.r-project.org/src/contrib/Archive/SpaDES.tools/SpaDES.tools_0.3.6.tar.gz"
#packageurl_17 <- "https://cran.r-project.org/src/contrib/Archive/NetLogoR/NetLogoR_0.3.7.tar.gz"
#install.packages(packageurl_1, repos=NULL, type="source")
#install.packages(packageurl_2, repos=NULL, type="source")
#install.packages(packageurl_3, repos=NULL, type="source")
#install.packages(packageurl_4, repos=NULL, type="source")
#install.packages(packageurl_5, repos=NULL, type="source")
#install.packages(packageurl_6, repos=NULL, type="source")
#install.packages(packageurl_7, repos=NULL, type="source")
#install.packages(packageurl_8, repos=NULL, type="source")
#install.packages(packageurl_9, repos=NULL, type="source")
#install.packages(packageurl_10, repos=NULL, type="source")
#install.packages(packageurl_11, repos=NULL, type="source")
#install.packages(packageurl_12, repos=NULL, type="source")
#install.packages(packageurl_13, repos=NULL, type="source")
#install.packages(packageurl_14, repos=NULL, type="source")
#install.packages(packageurl_15, repos=NULL, type="source")
#install.packages(packageurl_16, repos=NULL, type="source")
#install.packages(packageurl_17, repos=NULL, type="source")
Downloading libraries.
#library(NetLogoR)
library(stringr)
library(minpack.lm)
The simulation was run in a separate R-code file that is located in our project zip file (ABM_simulation.R). The code is commented out here as it takes a long time to complete all computations. During the simulation process, the graphic that slows down the computation was disabled.
Simulation of all the combinations of pubs [1-20] and agents [40-100] with for loop.
#for (number_agents in 40:100) {
#for (number_pubs in 1:20) {
### 1. DEFINE THE SPACE AND AGENTS ###
## simulations parameters
#simtime<-200 ## duration time of the simulation
#gridSize_x<-15 ## number of patches in the grid
#gridSize_y<-15
#displacement_normal<-0.1 ## speed of moving agents
#displacement_pub<-0.01 ## speed in the pub
#plot_data_out<-numeric()
## world set up
#w1 <- createWorld(minPxcor = 0, maxPxcor = gridSize_x-1, minPycor = 0,
# maxPycor = gridSize_y-1)
#x_pub <- randomPxcor(w1,number_pubs)
#y_pub <- randomPycor(w1,number_pubs)
#w1 <- NLset(world = w1, agents = patches(w1), val = 0)
#w1 <- NLset(world = w1, agents = patch(w1, x_pub, y_pub), val = 1)
## agents set up
#t1 <- createTurtles(n = number_agents, coords = randomXYcor(w1, n = number_agents),
# breed="S", color="black")
#t1 <- NLset(turtles = t1, agents = turtle(t1, who = 0), var = "breed", val = "I")
#t1 <- NLset(turtles = t1, agents = turtle(t1, who = 0), var = "color", val = "red")
#t1 <- turtlesOwn(turtles = t1, tVar = "displacement", tVal = displacement_normal)
#plot(w1, axes = 0, legend = FALSE, par(bty = 'n'))
#points(t1, col = of(agents = t1, var = "color"), pch = 20)
### 2. THE SIMULATION TIME LOOP ###
#for (time in 1:simtime) {
# t1 <- fd(turtles = t1, dist=t1$displacement, world = w1, torus = TRUE, out = FALSE)
# t1 <- right(turtles = t1, angle = sample(-20:20, 1, replace = F))
# plot(w1, axes = 0, legend = FALSE, par(bty = 'n'))
# points(t1, col = of(agents = t1, var = "color"), pch = 20)
# meet <- turtlesOn(world = w1, turtles = t1, agents = t1[of(agents = t1,
# var = "breed")=="I"])
# t1 <- NLset(turtles = t1, agents = meet, var = "breed", val = "I")
# t1 <- NLset(turtles = t1, agents = meet, var = "color", val = "red")
## agents that enter a pub spend more time there
# pub <- turtlesOn(world = w1, turtles = t1, agents = patch(w1, x_pub, y_pub))
# t1 <- NLset(turtles = t1, agents = turtle(t1, who = pub$who),
# var = "displacement", val = displacement_pub) ## if enters the pub
# t1 <- NLset(turtles = t1, agents = turtle(t1, who = t1[-(pub$who+1)]$who),
# var = "displacement", val = displacement_normal) ## if exits the pub
# Sys.sleep(0.1)
## store time-course data for plotting in the end
# contaminated_counter <- sum(str_count(t1$color, "red"))
# tmp_data <- c(time,contaminated_counter)
# plot_data_out <- rbind(plot_data_out, tmp_data)
#}
### 3. PLOTTING AND FITTING SIMULATED DATA ###
## perform non-linear curve fitting of the data
#df <- as.data.frame(plot_data_out)
#names(df) <- c("time","contaminated_counter")
#x <- df$time
#y <- df$contaminated_counter
## give arbitrarily initial guesses to 3,4,600,1000 and fit with 4-parameters logistic equation
#model <- nlsLM(y ~ d + (a-d) / (1 + (x/c)^b), start = list(a = 3, b = 4, c = 600, d = 1000))
## make a line with the fitting model that goes through the data
#fit_x <- data.frame(x = seq(min(x),max(x),len = simtime))
#fit_y <- predict(model, newdata = fit_x)
#fit_df <- as.data.frame(cbind(fit_x,fit_y))
#names(fit_df) <- c("x","y")
#fitted_function <- data.frame(x = seq(min(x),max(x),len = simtime))
#lines(fitted_function$x,predict(model,fitted_function = fitted_function))
## store summary statistics in a vector to be appended after each iteration to the output file
#simulation_run_name <- paste0("sim_",number_agents,"_",number_pubs)
#varied_params <- c(number_agents,number_pubs)
#summary_stat <- c(simulation_run_name, varied_params, as.vector(model$m$getPars()) )
## save summary statistics of all the performed simulations in file with:
## Simulation ID, parameters for simulation, outcome of the curve
#write.table(as.data.frame(t(summary_stat)), "./summary_stat.csv", sep = ",", col.names = FALSE,
# row.names=FALSE, append = TRUE)
#}
#}
As a result, a corresponding dataset of 1,220 (20 x 61) simulations was built, and it was exported as “summary_stat.csv”.
library(abc)
obs_data <- c(3, 2, 1143, 1655)
sim_param <- read.table(file="sim_param.dat",header=FALSE, sep = ",")
sim_data <- read.table(file="sim_data.dat",header=FALSE, sep = ",")
To compare the simulated dataset with the “observed data” of 3, 2, 1143, 1655, we used an ABC model with the “log” transformation applied to the parameter values and the “neuralnet” method.
result <- abc(target=obs_data,
param=sim_param,
sumstat=sim_data,
tol=0.005,
transf=c("log"),
method="neuralnet")
## 12345678910
## 12345678910
The graphical report generated by the ABC model. The table with the values of the inferred posterior distribution.
plot(result, param=sim_param)
result$adj.values
## V1 V2
## [1,] 51.08730 9.155542
## [2,] 51.06800 9.155598
## [3,] 51.08820 9.306829
## [4,] 52.12616 1.111161
## [5,] 51.06791 9.329076
## [6,] 51.08854 9.157650
## [7,] 51.06988 9.285002